This case study is part of my Google Data Analytics Capstone Project, where I’ll guide you through the six phases of Data Analytics: Ask, Prepare, Process, Analyze, Share and Act. Throughout these phases, I’ll demonstrate my ability to identify business tasks and key metrics, clean and process data for analysis, apply appropriate statistical techniques for data analysis and create interactive visualization using ggplotly. The final steps, including cleaning, analysis and visualization of data, will be executed using R Studio.
In 2016, Cyclistic launched a bike-share program in Chicago, which has since expanded to include 5,824 bicycles across 692 stations. Cyclistic offers flexible pricing plans, attracting a wide range of customers. However, finance analysts have found that annual members are more profitable than casual riders. To boost future growth, the marketing team aims to convert casual riders into annual members. To achieve this, they need to understand the differences in bike usage between the two groups.
Identify marketing strategies to convert casual riders into annual members. To do this, it is important to analyze how annual members and casual riders used Cyclistic bikes differently using the previous 12 months Cyclistic’s historical data.
Metrics that could be used to compare the ride patterns between casual and member riders:
The key stakeholders for this project include the Director of Marketing, the Cyclistic executive team, and the marketing analytics department.
For this analysis, data from the past 12 months (August 2023 - July 2024) was utilized. This data was stored on AWS cloud. Each monthly dataset contains essential details, including membership status (casual or member), type of bike used (classic, docked & electric), trip start and end times, and information about the start and end stations, along with their latitude and longitude coordinates.
We will use the following R packages:
To begin processing and cleaning the data, we will first import and merge the data sets.
Step 1: Import and merge datasets
# Define the path to the directory containing the CSV files
path <- "C:/Users/Smrut/OneDrive/Documents/Raw Data/Data"
#path <- "C:/Data Analytics Portfolio/Cyclistic Case Study/Subset_data"
# Get a list of all CSV files in the directory
files <- list.files(path = path, pattern = "*.csv", full.names = TRUE)
# Use lapply to read all files into a list of data frames
data_list <- lapply(files, read.csv)
# Optionally, combine all data frames in the list into one data frame
combined_data <- do.call(rbind, data_list)
# Dimensions of the data frame
dim(combined_data)
## [1] 5699639 13
# See the first 6 rows of the dataset
head(combined_data, n=6)
## ride_id rideable_type started_at ended_at
## 1 011C1903BF4E2E28 classic_bike 2023-09-23 00:27:50 2023-09-23 00:33:27
## 2 87DB80E048A1BF9F classic_bike 2023-09-02 09:26:43 2023-09-02 09:38:19
## 3 7C2EB7AF669066E3 electric_bike 2023-09-25 18:30:11 2023-09-25 18:41:39
## 4 57D197B010269CE3 classic_bike 2023-09-13 15:30:49 2023-09-13 15:39:18
## 5 8A2CEA7C8C8074D8 classic_bike 2023-09-18 15:58:58 2023-09-18 16:05:04
## 6 03F7044D1304CD58 electric_bike 2023-09-15 20:19:25 2023-09-15 20:30:27
## start_station_name start_station_id
## 1 Halsted St & Wrightwood Ave TA1309000061
## 2 Clark St & Drummond Pl TA1307000142
## 3 Financial Pl & Ida B Wells Dr SL-010
## 4 Clark St & Drummond Pl TA1307000142
## 5 Halsted St & Wrightwood Ave TA1309000061
## 6 Southport Ave & Wrightwood Ave TA1307000113
## end_station_name end_station_id start_lat start_lng end_lat
## 1 Sheffield Ave & Wellington Ave TA1307000052 41.92914 -87.64908 41.93625
## 2 Racine Ave & Fullerton Ave TA1306000026 41.93125 -87.64434 41.92557
## 3 Racine Ave & 15th St 13304 41.87506 -87.63314 41.86127
## 4 Racine Ave & Belmont Ave TA1308000019 41.93125 -87.64434 41.93974
## 5 Racine Ave & Fullerton Ave TA1306000026 41.92914 -87.64908 41.92557
## 6 41.92884 -87.66387 41.90000
## end_lng member_casual
## 1 -87.65266 member
## 2 -87.65842 member
## 3 -87.65663 member
## 4 -87.65887 member
## 5 -87.65842 member
## 6 -87.64000 member
remove(files)
remove(data_list)
Step 2: Process and clean data using data wrangling techniques. We need to retain records with complete membership status and valid start and end stations, ensuring all variables are converted to their appropriate data types.
# install.packages("tidyverse")
# install.packages("lubridate")
# Load the packages
library(tidyverse)
library(lubridate)
trips_data <- combined_data %>%
filter(member_casual != '' & start_station_id != '' & end_station_id != '') %>% #excluding missing records with subscription status, start and end station details
rename(bike_type = rideable_type) %>% #rename column as bike_type
mutate(bike_type = factor(bike_type, levels = c("docked_bike", "classic_bike", "electric_bike"), labels = c("Docked bike", "Classic bike", "Electric bike")), #assigning value labels to bike_type
member_casual = factor(member_casual, levels = c("member", "casual"), labels = c("Member", "Casual")), #assigning value labels to member_casual
started_at = ymd_hms(started_at), # converting char to date time
ended_at = ymd_hms(ended_at)) %>% # converting char to date time
filter(start_lat < 42.02 & end_lat < 42.02) #excluding latitude with top coordinates greater than 42.0 since these communities are outsider Chicago boundaries
# Remove the combines_data to free up disk space
rm(combined_data)
Step 3: Identify and remove duplicate rows based on the ride_id and start time to ensure data accuracy and consistency.
## Check for duplicates based on ride id (CC numbers which could be repeatedly used by a rider hence may not be unique per ride), start_station and start time
trips_data_1 <- trips_data[!duplicated(trips_data$ride_id , trips_data$started_at ), ]
rm(trips_data)
Step 4: Calculate new variables for analysis: * Trip duration in mins (ended_at - started_at) * Time, Date, month and year from the start date
#install.packages("hms") to extract the time from start date time variable
library(hms)
trips_data_2 <- trips_data_1 %>%
mutate(duration_trip = as.numeric(difftime(ended_at, started_at,units = c("mins"))), #Trip duration as diff between (ended_at - started_at)
time_trip = as_hms(started_at),
day_trip = day(started_at), #extract date from start date
month_trip = month(started_at, label = TRUE, abbr = FALSE ), #extract month from start date
year_trip = year(started_at), #extract year from start date
week_day_trip = wday(started_at,label = TRUE, abbr = FALSE )) #extract weekday from start date
library(ggplot2)
library(scales)
# Identifying outliers in duration using a histogram
ggplot(trips_data_2, aes(x=duration_trip)) +
ggtitle("Histogram plot for trip duration with outliers") +
xlab ("Trip duration in mins") +
geom_histogram(binwidth=1, color="black", fill="white") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
rm(trips_data_1)
Step 5: Removing extreme outliers and arriving at the final dataset for analysis
#install.packages("rstatix")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
outliers<- trips_data_2 %>%
identify_outliers(duration_trip) #Identifying extreme outliers
outliers <- outliers[c("ride_id","is.extreme")] #Keeping column with ride_ids and extreme outliers
trips_data_3 <- merge(x = trips_data_2, y = outliers, by = "ride_id", all.x = TRUE) #Merging using left join
#Final dataset for analysis
trips_data_4 <- trips_data_3 %>%
mutate(is.extreme = ifelse(is.na(is.extreme), "FALSE", is.extreme)) %>% #replacing the missing values with FALSE
filter(!duration_trip <= 1 & # Excluding rides with a duration of less than 1 minutes or duration in negative, as it is assumed these rides were not completed or the date times were not correctly entered
is.extreme %in% c("FALSE")) # Excluding the extreme outliers assuming data processing errors
# Histogram without outliers
ggplot(trips_data_4, aes(x=duration_trip)) +
ggtitle("Histogram plot for trip duration without extreme outliers") +
xlab ("Trip duration in mins") +
geom_histogram(binwidth=1, color="black", fill="white") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
rm(outliers)
rm(trips_data_2)
rm(trips_data_3)
Step 5: Analyse the data
The following analysis was conducted to observe the ride patterns between member vs casual riders:
Table 1: Summary of all ride characteristics
#install.packages("gtsummary")
library(gtsummary)
trips_data_4 %>%
select(bike_type, member_casual, duration_trip, day_trip, month_trip, week_day_trip) %>% #selecting vars for summary
tbl_summary(
label = list(bike_type = "Type of bike", member_casual = "Subscription status",
duration_trip = "Trip duration (mins)", 'day_trip' = "Day of the trip",
'month_trip' = "Month of the trip", 'week_day_trip' = "Week of the trip",
'year_trip' = "Year of the trip"), #assigning variable labels to display
type = all_continuous() ~ "continuous2", #summary stat to continue on another row
statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}")), #stats
digits = (list(all_categorical() ~ c(0, 1))) #display percentage with 1 decimal
)
| Characteristic | N = 3,966,7871 |
|---|---|
| Type of bike | |
| Â Â Â Â Docked bike | 0 (0.0%) |
| Â Â Â Â Classic bike | 2,640,104 (66.6%) |
| Â Â Â Â Electric bike | 1,326,683 (33.4%) |
| Subscription status | |
| Â Â Â Â Member | 2,637,277 (66.5%) |
| Â Â Â Â Casual | 1,329,510 (33.5%) |
| Trip duration (mins) | |
| Â Â Â Â Mean (SD) | 13 (10) |
| Â Â Â Â Median (Q1 - Q3) | 10 (6 - 17) |
| Â Â Â Â Min - Max | 1 - 55 |
| Day of the trip | |
| Â Â Â Â Mean (SD) | 16 (9) |
| Â Â Â Â Median (Q1 - Q3) | 15 (8 - 23) |
| Â Â Â Â Min - Max | 1 - 31 |
| Month of the trip | |
| Â Â Â Â January | 108,627 (2.7%) |
| Â Â Â Â February | 177,318 (4.5%) |
| Â Â Â Â March | 219,617 (5.5%) |
| Â Â Â Â April | 279,705 (7.1%) |
| Â Â Â Â May | 411,192 (10.4%) |
| Â Â Â Â June | 466,165 (11.8%) |
| Â Â Â Â July | 508,860 (12.8%) |
| Â Â Â Â August | 512,251 (12.9%) |
| Â Â Â Â September | 475,586 (12.0%) |
| Â Â Â Â October | 383,021 (9.7%) |
| Â Â Â Â November | 263,419 (6.6%) |
| Â Â Â Â December | 161,026 (4.1%) |
| Week of the trip | |
| Â Â Â Â Sunday | 512,155 (12.9%) |
| Â Â Â Â Monday | 527,302 (13.3%) |
| Â Â Â Â Tuesday | 557,795 (14.1%) |
| Â Â Â Â Wednesday | 602,352 (15.2%) |
| Â Â Â Â Thursday | 579,634 (14.6%) |
| Â Â Â Â Friday | 572,646 (14.4%) |
| Â Â Â Â Saturday | 614,903 (15.5%) |
| 1 n (%) | |
#Pie Chart to show the distribution of member and casual riders
# Proportion table for grouped bar chart
tab1 <- table(trips_data_4$member_casual) #frequency table
prop1 <- as.data.frame(prop.table(tab1)) #proportion with denominator as member_casual var
prop1$percent <- as.numeric(format(round(prop1$Freq*100,1),nsmall=1)) # percentage calculation
colnames(prop1) <- c("member_casual","prop","percent") #renaming col names
# Pie chart
p1 <- ggplot(data=prop1, aes(x="", y=percent, fill=member_casual)) +
geom_col(color = "black") +
coord_polar(theta = "y")+
geom_text(aes(label = percent), size = 3, position = position_stack(vjust = 0.5), vjust = -1, color = "black")
p1 + theme_void() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49"))
# Calculate the total trip duration
trips_data_4 %>%
select(member_casual, duration_trip) %>% #selecting vars for summary
summarise(sum_duration = sum(duration_trip)
)
## sum_duration
## 1 51900904
# Calculate the total trip duration by subscription status
trips_data_4 %>%
select(member_casual, duration_trip) %>% #selecting vars for summary
group_by(member_casual) %>%
summarise(sum_duration = sum(duration_trip)
)
## # A tibble: 2 × 2
## member_casual sum_duration
## <fct> <dbl>
## 1 Member 30609442.
## 2 Casual 21291462.
Table 2: Comparison of the Type of Bike Between Members and Casual Riders
trips_data_4 %>%
select(bike_type, member_casual) %>% #selecting vars for summary
tbl_summary(
by = member_casual, #grouping variable
label = list(bike_type = "Type of bike", member_casual = "Subscription status"), #assigning variable labels
digits = list(all_categorical() ~ c(0, 1)), #display percentage with 1 decimal
) %>%
add_p() # Add chi-square/fishers exact p value
## The following errors were returned during `add_p()`:
## ✖ For variable `bike_type` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
| Characteristic | Member N = 2,637,2771 |
Casual N = 1,329,5101 |
p-value |
|---|---|---|---|
| Type of bike | |||
| Â Â Â Â Docked bike | 0 (0.0%) | 0 (0.0%) | |
| Â Â Â Â Classic bike | 1,786,184 (67.7%) | 853,920 (64.2%) | |
| Â Â Â Â Electric bike | 851,093 (32.3%) | 475,590 (35.8%) | |
| 1 n (%) | |||
# Proportion table for grouped bar chart
tab1 <- table(trips_data_4$member_casual, trips_data_4$bike_type) #frequency table
prop1 <- as.data.frame(prop.table(tab1, margin = 1)) #proportion with denominator as member_casual var
prop1$percent <- as.numeric(format(round(prop1$Freq*100,1),nsmall=1)) # percentage calculation
colnames(prop1) <- c("member_casual","bike_type","prop","percent") #renaming col names
# Grouped bar chart
p <- ggplot(data=prop1, aes(x=bike_type, y=percent, fill=member_casual)) +
geom_bar(stat="identity", position=position_dodge()) +
ylim (0,100) +
xlab("Type of the bike") +
ylab ("Percentage") +
ggtitle("Distribution of usage by Bike Type") +
geom_text(aes(label = percent), size = 3, position = position_dodge(width = .9), vjust = -1, color = "black")
p + theme_classic() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) +
annotate(geom = "text", x=3, y=90, label = "Pearson’s Chi-squared test \n p<0.001")
Table 3: Comparison of Trip Duration Between Members and Casual Riders
# Histogram to check the distribution between member vs casual riders
p<- ggplot(trips_data_4, aes(x=duration_trip,fill=member_casual, color = member_casual)) +
geom_histogram(alpha = 0.5, position="identity", binwidth=1) +
theme_classic() +
labs(title = "Distribution plot of Trip Duration (mins)") +
xlab ("Trip duration (mins)") +
ylab ("Count")
#p2 + labs(fill = "Subscription status",
# color = "Subscription status")
p + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) +
scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) +
scale_x_continuous(breaks = seq(0, 60, by = 10))
Since the distribution for both the groups seems to be positively
skewed, we use box plot for graphical representation of the trip
duration between member vs casual riders.
trips_data_4 %>%
select(duration_trip, member_casual) %>% #selecting vars for summary
tbl_summary(
by = member_casual, #grouping variable
label = list(duration_trip = "Trip duration (mins)", member_casual = "Subscription status"),#assigning variable labels
type = all_continuous() ~ "continuous2", #each summary stat to continue on another row
statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}"))
) %>%
add_p() # add p value for wilcoxon signed rank test
## The following errors were returned during `add_p()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
| Characteristic | Member N = 2,637,277 |
Casual N = 1,329,510 |
p-value |
|---|---|---|---|
| Trip duration (mins) | |||
| Â Â Â Â Mean (SD) | 12 (9) | 16 (12) | |
| Â Â Â Â Median (Q1 - Q3) | 9 (5 - 15) | 12 (7 - 21) | |
| Â Â Â Â Min - Max | 1 - 55 | 1 - 55 |
# Boxplot
p <- ggplot(trips_data_4, aes(x=member_casual, y=duration_trip, fill=member_casual)) +
stat_boxplot(geom ='errorbar' , width = 0.5) +
geom_boxplot(width = 0.5, outlier.colour="black", outlier.shape = 20, outlier.size=1)+
labs(title="Trip Duration" , x="Subscription status", y = "Trip duration (mins)")
p + theme_classic() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) +
stat_summary(fun=mean, geom = "point", shape = 2, size = 2, color = "black") +
annotate(geom = "text", x=2.3, y=50, label = "Wilcoxon rank \n sum test \n p<0.001")
Table 4: Compare the average ride duration by each day of the week for
members vs casual users
trips_data_4 %>%
select(duration_trip, week_day_trip, member_casual) %>% #selecting vars for summary
tbl_strata( strata = week_day_trip, # Stratifying variable
.tbl_fun =
~ .x %>%
tbl_summary(
by = c(member_casual), #grouping variable
label = list(duration_trip = "Trip duration (mins)", member_casual = "Subscription status"),
type = all_continuous() ~ "continuous2",
statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}"))
) %>%
add_p(), #add p value for wilcoxon signed rank test
.header = "**{strata}**, N = {n}"
)
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
## package "cardx" (>= 0.2.1) is required.
| Characteristic | Sunday, N = 512155 | Monday, N = 527302 | Tuesday, N = 557795 | Wednesday, N = 602352 | Thursday, N = 579634 | Friday, N = 572646 | Saturday, N = 614903 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Member N = 288,591 |
Casual N = 223,564 |
p-value | Member N = 376,567 |
Casual N = 150,735 |
p-value | Member N = 411,476 |
Casual N = 146,319 |
p-value | Member N = 435,797 |
Casual N = 166,555 |
p-value | Member N = 415,366 |
Casual N = 164,268 |
p-value | Member N = 374,790 |
Casual N = 197,856 |
p-value | Member N = 334,690 |
Casual N = 280,213 |
p-value | |
| Trip duration (mins) | |||||||||||||||||||||
| Â Â Â Â Mean (SD) | 13 (10) | 18 (12) | 11 (9) | 15 (12) | 11 (9) | 14 (11) | 11 (9) | 15 (11) | 11 (9) | 14 (11) | 11 (9) | 16 (12) | 13 (10) | 18 (12) | |||||||
| Â Â Â Â Median (Q1 - Q3) | 10 (6 - 17) | 14 (8 - 24) | 8 (5 - 14) | 12 (7 - 20) | 9 (5 - 14) | 11 (7 - 19) | 9 (5 - 15) | 11 (7 - 19) | 9 (5 - 14) | 11 (7 - 19) | 9 (5 - 14) | 12 (7 - 21) | 10 (6 - 17) | 15 (9 - 24) | |||||||
| Â Â Â Â Min - Max | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | 1 - 55 | |||||||
#Calculating average duration for each week day between member and casual riders
t1 <- trips_data_4 %>%
group_by(member_casual, week_day_trip) %>% #groups by subscription status and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(duration_trip),
sd_duration = sd(duration_trip)) %>% # calculates the average duration
arrange(member_casual, week_day_trip)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
# Column Bar chart
#Plot of Average Trip Duration by Day of the Week Between Member vs Casual Riders
p <- ggplot(t1, aes(x=week_day_trip, y=average_duration, group=member_casual)) +
geom_line(aes(color=member_casual))+
geom_point(aes(color=member_casual)) +
#geom_errorbar(aes(y=), width=.2,
# position=position_dodge(0.05)) +
labs(title="Average Trip Duration by Day of the Week" ,
x="Day of the Week", y = "Trip duration (mins)") +
ylim(0,20)
p + theme_classic() + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49"))
#+ annotate(geom = "text", x=2.3, y=50, label = "Wilcoxon rank \n sum test \n p<0.001")
#Plot of Number of Rides by Day of the Week Between Member vs Casual Riders
p <- ggplot(t1, aes(x=week_day_trip, y=number_of_rides, fill=member_casual)) +
geom_col(position="dodge")+
labs(title="Number of Rides by Day of the Week" ,
x="Day of the Week", y = "Number of rides")
p + theme_classic() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Next, we will visualize the average number of rides taken at the specific time of the day.
#Calculating average number of rides between member and casual riders by time of the day
t2 <- trips_data_4 %>%
group_by(member_casual, hour = hour(time_trip)) %>% #groups by subscription status and every hour
mutate(number_of_rides = n()) %>%
ungroup()
p<-ggplot(t2, aes(x=time_trip, y=number_of_rides, group=member_casual)) +
geom_line(aes(color=member_casual)) + theme_classic() #+ geom_point(aes(color=member_casual))
breaks <- c(0, 6,12,18,24)*3600
labels <- format(as.POSIXct(breaks, tz = "UTC"), format = "%H:%M")
p + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) +
labs(title="Number of Rides by Time of the Day" ,
x="Time of the Day", y = "Number of rides") +scale_x_time(breaks = breaks, labels = labels)
In the following analysis, we will visualize the most frequent start and end stations based on the number of rides, and explore how this distribution varies between member and casual riders.
Plot x: Visualization for the most common start stations by subscription status
stn_name_final <- trips_data_4 %>%
group_by(start_station_id) %>% #groups by start_station_id
mutate(start_station_name_new = names(which.max(table(start_station_name)))) %>%
ungroup() %>%
group_by(end_station_id) %>% #groups by start_station_id
mutate(end_station_name_new = names(which.max(table(end_station_name))))
#Getting the one set of coordinate for each station id
station_info <- stn_name_final %>%
dplyr::group_by(start_station_id, start_station_name_new) %>% #groups by start_station_id
mutate (start_lat_final = mean(start_lat), start_lng_final = mean(start_lng)) %>%
dplyr::ungroup() %>%
dplyr::group_by(end_station_id, end_station_name_new) %>% #groups by end_station_id
mutate (end_lat_final = mean(end_lat), end_lng_final = mean(end_lng)) %>%
dplyr::ungroup()
#station_data <- trips_data_4 %>%
# distinct(start_station_name, start_station_id)
#station_dup <- station_data[duplicated(station_data$start_station_id), ]
# calculating the number of rides for each start station taken by member and casual riders
start_station_unique <- station_info %>% #filter(member_casual == "Casual") %>%
group_by(member_casual, start_station_id, start_station_name_new, start_lat_final, start_lng_final) %>%
summarise(number_of_rides = n()) %>% #calculates the number of rides
ungroup()
end_station_unique <- station_info %>% #filter(member_casual == "Casual") %>%
group_by(member_casual, end_station_id, end_station_name_new, end_lat_final, end_lng_final) %>%
summarise(number_of_rides = n()) %>% #calculates the number of rides
ungroup()
start_station_unique <- start_station_unique %>%
mutate(numBins = ifelse(number_of_rides <= 10000, number_of_rides,10000))
start_station_unique <- start_station_unique %>%
mutate(numBins = case_when(number_of_rides <= 2500 ~ "<2500",
number_of_rides > 2500 & number_of_rides <= 7500 ~ "2500 - 7500",
number_of_rides > 7500 & number_of_rides <= 10000 ~ "7500 - 10000",
number_of_rides > 10000 ~ ">10000"))
#install.packages("plotly")
library(plotly)
library(sf)
shape <- read_sf(dsn = "C:/Users/Smrut/OneDrive/Documents/Raw Data/Data",
layer = "geo_export_36491657-9a1b-46a0-a113-872487cb2834")
p <- ggplot(shape) +
geom_sf(fill = "grey", color = "white")
# Visualization for most common start points for rides taken by member vs casual riders
q <- p + geom_point(data=start_station_unique ,aes(x=start_lng_final,y=start_lat_final, color=numBins,
text = paste("Start station:", start_station_name_new, "\n", "No. of rides:", number_of_rides)),size=1,alpha=0.75) + xlab("Longitude") + ylab("Latitude") + ggtitle("Distribution of Starting Stations with Number of Rides in Chicago") +
facet_grid(. ~ member_casual, axes = "all") +
scale_colour_brewer(palette = "RdYlGn", direction = -1, name = "Number of Rides")
ggplotly(q, source = "select", tooltip = c("text"))
Plot x: Visualization for the most common end stations by subscription status
# Visualization for most common start points for rides taken by member vs casual riders
q <- p + geom_point(data=end_station_unique ,aes(x=end_lng_final,y=end_lat_final, color=number_of_rides,
text = paste("End station:", end_station_name_new, "\n", "No. of rides:", number_of_rides)),size=1,alpha=0.75) + xlab("Longitude") + ylab("Latitude") + ggtitle("Distribution of Ending Stations with Number of Rides in Chicago") +
facet_grid(. ~ member_casual, axes = "all") +
scale_colour_gradient(high="red",low='green', name = "Number of Rides")
ggplotly(q, source = "select", tooltip = c("text"))
### SHARE